128 research outputs found
Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering
We propose an unsupervised strategy for the selection of justification
sentences for multi-hop question answering (QA) that (a) maximizes the
relevance of the selected sentences, (b) minimizes the overlap between the
selected facts, and (c) maximizes the coverage of both question and answer.
This unsupervised sentence selection method can be coupled with any supervised
QA approach. We show that the sentences selected by our method improve the
performance of a state-of-the-art supervised QA model on two multi-hop QA
datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading
Comprehension (MultiRC). We obtain new state-of-the-art performance on both
datasets among approaches that do not use external resources for training the
QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1%
EM0 on MultiRC. Our justification sentences have higher quality than the
justifications selected by a strong information retrieval baseline, e.g., by
5.4% F1 in MultiRC. We also show that our unsupervised selection of
justification sentences is more stable across domains than a state-of-the-art
supervised sentence selection method.Comment: Published at EMNLP-IJCNLP 2019 as long conference paper. Corrected
the name reference for Speer et.al, 201
Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering
Evidence retrieval is a critical stage of question answering (QA), necessary
not only to improve performance, but also to explain the decisions of the
corresponding QA method. We introduce a simple, fast, and unsupervised
iterative evidence retrieval method, which relies on three ideas: (a) an
unsupervised alignment approach to soft-align questions and answers with
justification sentences using only GloVe embeddings, (b) an iterative process
that reformulates queries focusing on terms that are not covered by existing
justifications, which (c) a stopping criterion that terminates retrieval when
the terms in the given question and candidate answers are covered by the
retrieved justifications. Despite its simplicity, our approach outperforms all
the previous methods (including supervised methods) on the evidence selection
task on two datasets: MultiRC and QASC. When these evidence sentences are fed
into a RoBERTa answer classification component, we achieve state-of-the-art QA
performance on these two datasets.Comment: Accepted at ACL 2020 as a long conference pape
A Bootstrapping architecture for time expression recognition in unlabelled corpora via syntactic-semantic patterns
In this paper we describe a semi-supervised approach to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping.
Bootstrapping techniques rely on a relatively small amount of initial human-supplied examples (termed “seeds”) of the type of entity or concept to be learned, in order to capture an initial set of patterns or rules from the unlabelled text that extract the supplied data. In turn, the learned patterns are employed to find new potential examples, and the process is repeated to grow the set of patterns and (optionally) the set of examples. In order to prevent the learned pattern set from producing spurious results, it becomes essential
to implement a ranking and selection procedure to filter out “bad” patterns and, depending on the case, new candidate examples. Therefore, the type of patterns employed (knowledge representation) as well as the ranking and selection procedure are paramount to the quality of the results. We present a complete bootstrapping algorithm for recognition of time expressions, with a special emphasis on the type of patterns used (a combination of semantic and morpho- syntantic elements) and the ranking and selection criteria. Bootstrap-
ping techniques have been previously employed with limited success for several NLP problems, both of recognition and classification, but their application to time expression recognition is, to the best of our knowledge, novel. As of this writing, the described architecture is in the final stages of implementation, with experimention and evalution being already underway.Postprint (published version
It is not Sexually Suggestive, It is Educative. Separating Sex Education from Suggestive Content on TikTok Videos
We introduce SexTok, a multi-modal dataset composed of TikTok videos labeled
as sexually suggestive (from the annotator's point of view), sex-educational
content, or neither. Such a dataset is necessary to address the challenge of
distinguishing between sexually suggestive content and virtual sex education
videos on TikTok. Children's exposure to sexually suggestive videos has been
shown to have adversarial effects on their development. Meanwhile, virtual sex
education, especially on subjects that are more relevant to the LGBTQIA+
community, is very valuable. The platform's current system removes or penalizes
some of both types of videos, even though they serve different purposes. Our
dataset contains video URLs, and it is also audio transcribed. To validate its
importance, we explore two transformer-based models for classifying the videos.
Our preliminary results suggest that the task of distinguishing between these
types of videos is learnable but challenging. These experiments suggest that
this dataset is meaningful and invites further study on the subject.Comment: Accepted to ACL Findings 2023. 10 pages, 3 figures, 5 tables . Please
refer to https://github.com/enfageorge/SexTok for dataset and related detail
Time Travel in LLMs: Tracing Data Contamination in Large Language Models
Data contamination, i.e., the presence of test data from downstream tasks in
the training data of large language models (LLMs), is a potential major issue
in understanding LLMs' effectiveness on other tasks. We propose a
straightforward yet effective method for identifying data contamination within
LLMs. At its core, our approach starts by identifying potential contamination
in individual instances that are drawn from a small random sample; using this
information, our approach then assesses if an entire dataset partition is
contaminated. To estimate contamination of individual instances, we employ
"guided instruction:" a prompt consisting of the dataset name, partition type,
and the initial segment of a reference instance, asking the LLM to complete it.
An instance is flagged as contaminated if the LLM's output either exactly or
closely matches the latter segment of the reference. To understand if an entire
partition is contaminated, we propose two ideas. The first idea marks a dataset
partition as contaminated if the average overlap score with the reference
instances (as measured by ROUGE or BLEURT) is statistically significantly
better with the guided instruction vs. a general instruction that does not
include the dataset and partition name. The second idea marks a dataset as
contaminated if a classifier based on GPT-4 with in-context learning prompting
marks multiple instances as contaminated. Our best method achieves an accuracy
between 92% and 100% in detecting if an LLM is contaminated with seven
datasets, containing train and test/validation partitions, when contrasted with
manual evaluation by human expert. Further, our findings indicate that GPT-4 is
contaminated with AG News, WNLI, and XSum datasets.Comment: v1 preprin
Analyzing the Language of Food on Social Media
We investigate the predictive power behind the language of food on social
media. We collect a corpus of over three million food-related posts from
Twitter and demonstrate that many latent population characteristics can be
directly predicted from this data: overweight rate, diabetes rate, political
leaning, and home geographical location of authors. For all tasks, our
language-based models significantly outperform the majority-class baselines.
Performance is further improved with more complex natural language processing,
such as topic modeling. We analyze which textual features have most predictive
power for these datasets, providing insight into the connections between the
language of food, geographic locale, and community characteristics. Lastly, we
design and implement an online system for real-time query and visualization of
the dataset. Visualization tools, such as geo-referenced heatmaps,
semantics-preserving wordclouds and temporal histograms, allow us to discover
more complex, global patterns mirrored in the language of food.Comment: An extended abstract of this paper will appear in IEEE Big Data 201
- …